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Abstract 



XSLT 2.0 provides a wide range of new features, many of which make light work of tasks that 
are notoriously difficult in XSLT 1 .0, such as grouping and string manipulation. This paper 
attempts to show how these facilities not only make coding easier, but will also extend the scope 
of the language making it possible to tackle problems that were quite outside the range of XSLT 
1.0. 

The paper shows case study of a multi-phase transformation taking data from a legacy ASCII- 
based interchange format, to XML based on a standardized vocabulary. The transformations 
illustrate the power of new features including regular expression handling, grouping, recursive 
functions, and schema-aware processing. 

The conclusion of the paper is that these new facilities - notably regular expression handling and 
grouping - take XSLT into new territory, where languages such as Perl previously reigned 
supreme. XSLT 1.0 works best where all the structure in a document is already identified by 
markup. XSLT 2.0 will also be able to handle many situations where the structure is implicit in 
the text, or in markup designed for presentation purposes rather than to capture the information 
semantics. It thus becomes a powerful tool for "up-conversion" applications. These facilities 
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work will in conjunction with schema-aware processing, where the aim of the exercise is to 
create XML that conforms to a target schema. 
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1. Introduction 

XSLT 1.0 became a W3C Recommendation in November 1999; it has attracted at least twenty 
implementations and a very sizeable user base. It is used mainly for two distinct applications: rendering of 
XML documents by converting them into a presentation-oriented vocabulary (usually HTML, sometimes 
XSL-FO, XHTML, or SVG); and conversion of data-oriented XML messages, either into a different 
vocabulary, or to a different document using the same vocabulary, but with different information content. 
Within these two categories there are some highly creative and innovative applications, a notable example 
being Schematron, which uses XSLT transformations to apply structural and semantic validation rules to a 
document. 

Although XSLT 1.0 is designed to transform source XML trees into result XML trees, it also includes 
three serialization methods, allowing the result tree to be output either as lexical XML, HTML, or text. 
This enables a wide range of applications in which the output is in textual form: I have seen XSLT 
stylesheets that generated Java programs, SQL code, comma-separated-values files, and EDI messages. 

However, this ability to generate multiple output formats is not mirrored on the input side. XSLT 1.0 has 
very little capability to take anything other than XML as its input. There are ways around this: for example 
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in the first edition of my book XSLT Programmer's Reference I showed how one could write a parser for a 
non-XML format such as the GEDCOM 5.5 format used for genealogical data, and by making this parser 
implement the SAX interface supported by many XSLT processors, one could present the parsed input data 
to the XSLT 1 .0 processor as if it came from an XML parser. However, this is really only a minor 
improvement on what can be achieved by writing a GEDCOM-to-XML converter as a standalone 
application. 

XSLT 2.0, as I will show in this paper, greatly extends the ability of XSLT to process any textual input, 
without the need to write conversion code in Java or another procedural programming language. It 
therefore enables XSLT to be used not only for XML-to-XML and XML-to-text applications, but also for 
text-to-XML conversions. More generally, it allows XSLT 2.0 to be used for up-conversion. 

In the broadcasting industry, the term upconversion (usually without a hyphen) is used to mean the 
conversion of a low-frequency video format to an equivalent high-frequency format. In the SGML and 
XML world, the word refers to the generation of a format with detailed markup from a format with less- 
detailed or no markup, where it is necessary to generate the additional markup by recognizing structural 
patterns that are implicit in the textual content itself. By extension the term is also used for converting non- 
SGML or non-XML markup into SGML/XML: this usage is justified, of course, on the basis that 
SGML/XML is obviously on a higher plane than any alternative markup language! 

I will start this paper with a survey of the new features in XSLT 2.0 that make it easier to write up- 
conversion transforms (it really doesn't make much sense to call them stylesheets any more, but I will slip 
into that usage occasionally). I will then present a case study of a particular up-conversion. I will use the 
example I mentioned earlier, conversion of GEDCOM genealogical data: but this time, the entire job will 
be done in XSLT 2.0, with no need to write preprocessing software in a procedural language. 



2. Up-Conversion Facilities in XSLT 2.0 

In this section I will describe how four of the new features in XSLT 2.0 can be used to assist in writing up- 
conversion applications. The four features discussed are: 

• The unparsed-text() function 



Regular expression processing 

• Grouping facilities 

• Schema-aware processing 

The descriptions here are brief introductions to these facilities: for full information, see the W3C 
specifications of XSLT 2.0 [ XSLT 2.0 ] and XPath 2.0 [ XPath 2.0 ]. or my books XSLT 2. 0 Programmer's 
Reference [Kay. 2004a] and XPath 2.0 Programmer's Reference [Kay. 2004b] . 



2.1 The unparsed-textQ function 
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In order to handle non-XML input, the first thing a stylesheet needs to be able to do is to read it. For this 
purpose, XSLT 2.0 provides the unparsed-text() function. This takes a URI as its first argument, and loads 
the text of the resource found at that URI. The result is a character string - that is, a value of type 
xs : string, where "xs" is the XML Schema namespace. The type system of XSLT 2.0 is based on the 
types defined in the XML Schema specification. 

In fact, it was already possible in XSLT 1.0 to provide a stylesheet with non-XML input, in the form of a 
string-valued stylesheet parameter (parameters can be declared using a global <xsl:param> element). 
However, this imposes constraints, for example it is difficult to handle a variable number of such inputs. 
Allowing URI-addressible resources to be accessed directly makes the job much easier. 

Character encoding is of course a problem. The unparsed-text() function allows a second parameter to 
specify the character encoding explicitly, or it can be guessed from external information - the XSLT 2.0 
spec refers to the algorithms and heuristics defined in the XLink specification for this purpose. In practice, 
if the file is an arbitrary file in operating system filestore with no associated metadata, guessing its 
encoding is sometimes going to give wrong answers. Sadly, there is no easy solution to this difficulty. 

The fact that the result of the unparsed-text() function must be an xs : string imposes a constraint: the 
only characters allowed in the file are those permitted in XML documents. This same constraint also 
applies to any text output produced by a stylesheet. It means that XSLT is now capable of reading textual 
input and writing textual output, but it cannot be used to handle binary input or binary output, unless these 
are first translated into some textual representation. 

2.2 Regular expression processing 

XSLT 1.0 has been much criticized for its rather primitive text-handling capabilities: the function library 
provided for string handling in XPath 1 .0 is designed very much on "reduced instruction set computing" 
principles - you can achieve pretty well anything, but the complexity of the programming needed even for 
some quite simple tasks can be daunting. In particular, for many users (whether or not they have a 
programming background), writing string manipulation routines in terms of recursive templates can 
present a big conceptual barrier. 

I don't know the history of the decisions that brought this situation about. I have always thought the 
statement at the start of the XSLT 1.0 specification, to the effect that XSLT is not a general-purpose 
programming language, was very suggestive: committees don't put a statement like that in a specification 
unless there has been a vigorous debate on the matter, and the fact that the statement is there means there 
must have been a strong "keep it simple" camp on the working group who won the debate. Which is 
probably a good thing, given the length of time the world has been waiting for an XQuery 
recommendation. 

But the fact is, there is a large class of applications for which the text processing capability in XSLT 1 .0 is 
woefully inadequate - and this includes most up-conversion applications. XSLT 1.0 is very good at 
performing structural transformations - that is, at rearranging the nodes in a tree. It is much less good at 
manipulating the textual content of those nodes. By definition, up-conversion applications are those where 
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the input doesn't have explicit structure, but rather has structure that is implicit in the text, and therefore 
they need good text processing capability. 

Users of Perl and similar languages have long been accustomed to the power of regular expressions 
(regexes). In fact, they are so powerful they can become addictive: whereas programmers from other 
disciplines might turn to regular expressions as a last resort, there are Perl programmers who see almost 
any problem as an opportunity for creativity in their use of regexes. 

XPath 2.0 offers three functions in its standard function library that perform regular expression processing. 
Specifically: 

• matches(): returns a boolean value indicating whether a particular string matches a regular 
expression. 

• replace(): replaces those substrings within a given string that match a regular expression, with a 
replacement string. 

• tokenize(): breaks a string into a sequence of substrings, based on finding delimiters or separators 
that match a given regular expression. 

Conspicuously missing from this list is any function that allows markup to be inserted into a string. It can 
be done somewhat laboriously by combining the different functions together, but using these three 
functions alone to translate See [ 2 ] into See <ref >2</ref > is painfully hard work. The reason for the 
omission is that it's hard to solve the requirement with a simple function. 

The XSLT/XQuery /XPath programming model, despite the fact that it owes a great deal to functional 
programming theory, does not support higher-order functions. That is, functions are not first-class objects 
and cannot be supplied as arguments to other functions. This greatly limits the power of what can be 
achieved with a function library alone. All higher-order capabilities in the three languages are instead 
achieved by means of higher-order operators, custom syntax, or XSLT instructions. An example is the 
XPath for expression, which in a pure functional language would be expressed as a higher-order map or 
apply operator taking a sequence as its first argument and a function (to be applied to each member of the 
sequence) as its second argument; another example is the construct seq [ p] which is essentially a higher- 
order filter function that takes a sequence as its first argument and a predicate as its second. 

So the XSLT solution to this problem is an instruction, xsi : anaiyze-string, that logically takes four 
arguments: the string to be analyzed, a regex, an instruction to be executed to process substrings that match 
the regular expression, and an instruction to be executed to process substrings that don't match. The earlier 
example that turns See [2] into See <ref>2</ref> can then be coded as follows: 

<xsl : anaiyze-string select-"' $input " regex="\ [ . *?\] "> 
<xsl : matching-substring> 

<refxxsl:value-of select="translate ( . , • [ ) ' , ,, )"/></ref> 
</xsl : matching-subst ring> 
<xsl : matching-substring> 

<xsl : value-of select 55 " . " /> 
</xsl :matching-substring> 
</xsl : analyze- st ring> 
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Those who are comfortable with regular expressions will have little difficulty following what 
regex="\ [ . *?\] " does: \ [ matches an opening square bracket, . * matches any sequence of characters, 
the ? is a modifier indicating that the . * should match the shortest possible sequence of characters 
consistent with the regex as a whole succeeding, and the \ ] matches a closing square bracket. 

The semantics of xsi : anaiyze-string are that the input string is scanned from left to right looking for 
substrings that match the regex. Substrings that don't match the regex are passed (as the context item, ".") 
to the xsi :non-matching-substring instruction, which in this case copies them unchanged, while 
substrings that do match the regex are passed to xsi :matching-substring, which in this example wraps 
the substring in a re f element, using the (XSLT 1.0) translate() function to drop the delimiting square 
brackets. (Regex devotees will find a different way of doing this, but the old translate() function suits me 
fine.) 

There is no equivalent facility to xsi : anaiyze-string in XQuery. In the latest release (version 8.1) of 
Saxon I have introduced an extension to support higher-order functions, and have used this to provide an 
extension function saxon: anaiyze-strinq[ Saxonica t 2004 ] that takes as its arguments the string to be 
processed, the regex, and two functions to be applied to the matching and non-matching substrings 
respectively. It's not quite as convenient to use as the XSLT 2.0 construct, but it demonstrates that if 
higher-order functions were available in the language, there would be a lot less need for custom syntax to 
solve such problems. 

2.3 Grouping facilities 

Grouping problems probably form the largest category of tricky-to-solve problems faced by XSLT 1.0 
users. I classify any problem as a grouping problem if it requires the addition of an extra layer of hierarchy 
in the result tree that is not present in the source tree. Grouping problems fall essentially into two 
categories: those that group elements having matching data values, and those that group elements based on 
their position in a sequence (for example, a heading element followed by all the para elements up to the 

next heading). 

XSLT 1.0 offers no inbuilt support for solving grouping problems, and neither does XQuery 1.0. The 
standard solution for value-based grouping in XSLT 1 .0 is a technique using keys, which was invented by 
Steve Muench of Oracle and is therefore known as Muenchian grouping: its best description is that by Jeni 
Tennison at [Tennison] . (Steve never published it himself: he first described it in a personal email to me, 
and I announced his discovery to the world. I am very pleased that he got the credit he deserved, which is 
unusual in our industry.) For positional grouping, a number of techniques are possible, generally involving 
recursive processing using the following-sibling axis. (Unfortunately neither keys nor the following-sibling 
axis are available in XQuery, so XQuery users are going to struggle with this one.) 

XSLT 2.0 offers a new instruction, xsi : f or-each-group, to perform grouping. It provides four ways to 
define the grouping criterion: simple value-based grouping (the most common requirement) can be 
achieved by defining an expression to compute the grouping key, while the other three options define 
various kinds of positional grouping criteria. The body of the xsi : f or-each-group instruction is then 
executed once for each group of nodes identified. 



http ://www. idealliance . org/proceedings/xml04/papers/ 1 1 1 /mhk-paper . html 



11/12/06 



Up-conversion using XSLT 2.0 



Page 7 of 15 



To take a simple example, the following code takes a flat list of author elements, and groups them so that 
authors with the same affiliation appear as children of an affiliation element: 

<xsl : f or-each-group select="author" group-by="af f iliation"> 

<af filiation name=" {current-grouping-key ( ) }"> 
<xsl : copy-of select="current-group ( ) "/> 

</af f iliation> 
</xsl : f or-each-group> 

What is the relevance of this to up-conversion, the subject of this paper? The answer is that up-conversion 
involves detection of implicit structure, and replacement of the implicit structure by explicit markup. This 
is exactly what grouping facilities are doing. This time, the implicit structure is not found by parsing the 
text, but by looking for patterns in the existing markup. This will become very clear in my case study, 
presented in the second half of this paper. 

Like xsi : anaiyze-string, the xsl : f or-each-group instruction is essentially syntactic sugar for a 
higher-order function. This time you can think of it (specifically the variant for value-based grouping) as a 
function whose arguments are the sequence to be grouped, a function to calculate the grouping key, and a 
function to be evaluated once for each group of items in the input sequence. So that XQuery users can take 
advantage of the grouping facilities in Saxon, I have again provided a higher-order extension function in 
Saxon 8.1 that provides this capability: its name is saxon: f or-each-group ( ) [Saxonica . 2004 ]. As with 
anaiyze-string, it is slightly clumsier to use than the custom syntax provided in XSLT 2.0, but again 
shows how much more power there would be in the language if higher-order functions were a standard 
feature. 

2.4 Schema-aware processing 

The most radical difference between XSLT 2.0 and XSLT 1 .0 is that the language has become strongly 
typed, with a type system based on XML Schema. This has been done in such a way that untyped 
(schemaless) processing is still possible as a fallback. There are many reasons this change has taken place, 
and much debate about the desirability of making such a radical change, especially in view of the fact that 
XML Schema is widely criticized both for its complexity and for the limitations in its capability. I would 
like to concentrate here, however, on its impact for writing up-conversion applications. 

Since up-conversion often starts with an input file that is not XML, it is unlikely that an XML Schema will 
exist to describe its structure. Fortunately this is not a problem: XSLT is still perfectly happy to work with 
untyped, schemaless data. 

I have often found that it is best to structure an up-conversion as a sequence of two (maybe more) 
transformations. The first transformation takes the raw input data in whatever legacy format it arrives in, 
and translates it to an XML representation that is as close to the original structure as possible, consistent 
with it being XML. The second transformation takes this raw XML and translates it to the desired target 
XML vocabulary. 



The target vocabulary typically represents XML that is designed to have significant visibility: it may be 
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long-lived, widely-shared, or both. Therefore, it is very likely that there will exist an XML Schema for this 
vocabulary. The schema-aware capabilities of XSLT that are relevant to up-conversion therefore tend to be 
those that are concerned with validating the result tree, rather than those concerned with processing the 
source. In the case study I will show how this validation assisted with the development process for creating 
correct XSLT transformations. The case study in this paper is an artificial one, it was constructed largely 
for pedagogic purposes, but I have had the same experiences in a real project involving the capture of 
human resources data from Excel spreadsheets for transfer into an XML database. 



3. An up-conversion case study: GEDCOM 

In this second part of this paper we will look at how the constructs introduced in the previous section are 
used in a practical example of an up-conversion exercise. 

3.1 Description of the Problem 

Genealogical data is interesting for a number of reasons. Genealogy is one of the most popular applications 
of the web for millions of people, and its success relies on the ability to exchange data between different 
application packages. The data itself is quite complex, for two reasons: the variety of information that 
people want to record, and the need to capture uncertain information and conflicting versions of events. 
For many years genealogical data has been exchanged using a format called GEDCOM [LPS, 1996 ] . 
devised by the Church of Jesus Christ of Latter-Day Saints (the Mormons). GEDCOM 5.5 uses a 
hierarchic record format rather in the style of a COBOL data definition, typified by the following entry: 

0 01530 INDI 

1 NAME Michael Howard /KAY/ 
1 SEX M 

1 BIRT 

2 DATE 11 OCT 1951 

2 PLAC Hannover, Germany 

3 MAP 

4 LATI N52 
4 LONG E9 

1 OCCU Software Designer 

2 DATE FROM 1975 TO 2004 

1 EDUC Postgraduate 

2 DATE FROM 1969 TO 1975 

2 PLAC Cambridge, England 

3 MAP 

4 LATI N52 
4 LONG E0 

2 NOTE PhD in Computer Science 
1 FAMS @F2336 
1 FAMC 8F2216 

The @I53@ field is a record identifier, and the values <3F233@ and @F221@ are pointers to other records 
(specifically, the record describing the family in which this individual is a parent, and the record describing 
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the family in which this individual is a child). 

This can of course be directly translated to an XML syntax, such as this: 



<INDI> 

<NAME>Michael Howard /KAY/ </NAME> 

<SEX>M<7SEX> 

<BIRT> 

<DATE>11 OCT 1951</DATE> 
<PLAC>Hannover / Germany 
<MAP> 

<LATI>N52</LATI> 
<LONG>E9</LONG> 
</MAP> 
</PLAC> 
</BIRT> 

<OCCU>Software Designer 

< DAT E> FROM 1975 TO 2004</DATE> 

</occu> 

<EDUC> Postgraduate 

< DAT E> FROM 1969 TO 1975</DATE> 
<PLAC>Cambridge, England 
<MAP> 

<LATI>N52</LATI> 
<LONG>E0</LONG> 
</MAP> 
</PLAC> 

<NOTE>PhD in Computer Science</NOTE> 
</EDUC> 

<FAMS REF="F233'7> 
<FAMC REF="F221"/> 
"</INDI> 

The first stage of our up-conversion application will be to convert the data into this form. After that we will 
see how to convert it further to the actual target XML vocabulary defined by the proposed GEDCOM- 
XML standard. 

3.2 Stage One: Conversion to Raw XML 

In my book XSLT Programmer's Reference (including the latest edition for XSLT 2.0) I describe how to 
perform this step by writing a GEDCOM parser in Java. The fact is, however, that it can be coded entirely 
in XSLT 2.0, and that the XSLT 2.0 code is actually shorter than the Java implementation. Let's see what it 
looks like. 

First we have to read the input file, which we can do like this: 



<xsl :param name=" input " as="xs : string" required="yes" /> 

< xs 1 : va r i a b 1 e n ame= " i npu t - 1 e xt " 
as="xs : string" 

select="unparsed-text ($input, ? iso-8859-1 ' ) "/> 
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(I've actually cheated here. GEDCOM requires files to be encoded in a character set called ANSEL, 
otherwise ANSI Z39.47-1985, which is used for almost no other purpose. If ANSEL were a mainstream 
character encoding, it could be specified in the second argument of the unparsed-text ( ) function call. In 
practice, however, it is rather unlikely that any XSLT 2.0 processor would support this encoding natively. 
Therefore, the conversion from ANSEL to a mainstream character encoding will still have to be done in a 
pre-processing phase.) 

The next stage is to split the input into lines, which can be done using the XPath 2.0tokenize() function. 
Since the unparsed-text ( ) function does not normalize line endings (this might yet change) the regular 
expression for matching the separator between tokens accepts both UNIX and Windows line endings. The 
result is a sequence of strings, one for each line of the input file: 

<xsl: variable name=" lines" 

as="xs : string*" 

select="tokenize ($ input -text, ' \r?\n' } "/> 

Now we need to parse the individual lines. Each line in a GEDCOM file has up to five fields: a level 
number, an identifier, a tag, a cross-reference, and a value. We will create an XML line element 
representing the contents of the line, using attributes to represent each of these five components: 

<xsl : variable name="parsed-l ines 

as="element (line) *"> 
<xsl : f or-each select ="$1 ines "> 

<xsl : arialyze-string select-" . " f lags="x" 

regex=" A ( [0-9]+) \s+ 

(@ { [A-Za-zO-9J+) 8) ?\s* 

( [A-Za-z] * ) ?\s* 

(@ ( [A-Za-zO-9] + ) 6) ? 

(.*)$"> 

<xsl : ma t ching- subs t ring > 

<line level-" {regex-group (1) } " 
ID=" {regex-group (3) }" 
tag=" { regex-group (4 )} " 
REF=" { regex-group ( 6) } " 
text-" {regex-group (7) }"/> 
</xsl : matching-substring> 
<xsi : non-ma t ching- subs tring> 
<xsl :message> 

Non-matching line "<xsl : value-of select-" ." />" 
</xsl : message> 
</xsl : non-matching-substring> 
</xsl : analyze-string> 
</xsl : f or-each> 
</xsl : variable> 

Note first the as attribute on the xsi : variable declaration. I have consistently been declaring the types of 
my variables: this helps to pick up programming errors and it documents the stylesheet for the reader. I can 
do this even with a non-schema-aware stylesheet: the form element (line) * indicates that the variable 
holds a sequence of elements whose name is line. I could further constrain them to conform to a line 
element declaration in an XML schema by writing schema-element (line) *, but I've chosen not to do 
that here, because it's too much effort to create a schema to describe this transient data structure. 
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The actual content of the elements is constructed by analyzing the text of the input GEDCOM line using a 
regular expression. The attribute f lags="x" allows the regex to be split into multiple lines for readability. 
The five lines of the regex correspond to the five fields that may be present. I describe this usage of 
xsi : anaiyze-string as a "single-match" usage, because the idea is that the regular expression matches 
the entire input string exactly once, and the xsl : non-matching-substring instruction is used only to 
catch errors. Within the xsl :matching-substring instruction, the content of the line is picked apart using 
the regex-group { ) function, which returns the part of the matching substring that matched the n'th 
parenthesized subexpression within the regex. If the relevant part of the regex wasn't matched (for 
example, if the optional identifier was absent) then this returns a zero-length string, and our XSLT code 
then creates a zero-length attribute. 

So we now have a sequence of XML elements each representing one line of the GEDCOM file, each 
containing attributes to represent the contents of the five fields in the input. The next stage is to convert 
this flat sequence into a hierarchy, in which level 2 lines (for example) turn into XML elements that 
contain the corresponding level 3 lines. 

Any problem that involves adding hierarchic levels to the result tree, that were not present in the source 
tree, can be regarded as a grouping problem, and it should therefore be no surprise that we tackle it using 
the xsl : f or-each-group instruction. This time a group consists of a level N element together with the 
following elements up to the next one at level N. So this is a positional grouping rather than a value-based 
grouping. The option that we use to tackle this is the group-starting-with attribute, whose value is a 
match pattern that is used to recognize the first element in each group. 

A single application of xsl : f or-each-group creates one extra level in the result tree. In this example, we 
have a variable number of levels, so we want to apply the instruction a variable number of times. First we 
group the overall sequence of line elements so that each level 0 line starts a new group. Within this group, 
we perform a further grouping so that each level 1 line starts a new group, and so on up to the maximum 
depth of the hierarchy. As one might expect, the process is recursive: we write a recursive template that 
performs the grouping at level N, and that calls itself to perform the level N+l grouping. This is what it 
looks like: 

<xsl : template name="proces s- level "> 

<xsl :param name= "population" required="yes" as="element ( ) *"/> 
<xsl : param name=" level" required="yes" as="xs : integer"/> 
<xsl : f or-each-group select="$population" 

group-starting-with="* [xs : integer (@ level) eq $ level] "> 
<xsl : element name=" { @tag} "> 

<xsl:copy-of select=" @ ID [ string ( . ) ] , @REF [string (.)]"/> 
<xsl : value -of select^" normalize -space ( @text ) " /> 
<xsl : call-template name="process-level"> 
<xsl : with-param name="population" 

select="current-group () [position ( ) != l]"/> 
<xsl : with-param name=" level " 

select="$level + l"/> 
</xsl : call-template> 
</xsl : element> 
</xsl : f or-each-group> 
</xsl : template> 
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When this is called to process all the line elements with the $ level parameter set to zero, it forms one 
group for each line having the attribute level="0", containing that line and all the following lines up to 
the next one with ievei="0". It then processes each of these groups by creating an element to represent 
the level 0 line (the name of this element is taken from the GEDCOM tag, and its ID and IDREF attributes 
are copied unless they are empty), and constructs the content of this new element by means of a recursive 
call, processing all elements in the group except the first, and looking this time for level 1 lines as the ones 
that start a new group. The process continues until there are no lines at the next level (the f or-each-group 
instruction does nothing if the population to be grouped is empty). 

The remaining code in the stylesheet simply invokes this recursive template to process all the lines at level 
0: 

<>:sl : template name="main"> 

<xsl : call-template name="proce3S-level"> 
<xsl : with-param name= M population" 

s e 1 e c t = " $pa r s ed- 1 ine s Aged / 1 i ne " / > 
<xsl : with-param name="level " 
select="0"/> 
</xsl : call-template> 
</xsl : template> 

This main template represents the entry point to the stylesheet. There is no match="/" template rule, 
because there is no source XML document with a root node to be matched; instead, XSLT 2.0 allows a 
transformation to be invoked by specifying the name of a named template where execution is to start. I use 
the name main as a matter of convention. 

We have now converted the GEDCOM data to XML. The next step is to convert it to the actual XML 
vocabulary that the target application requires. 

3.3 Stage Two: Conversion to the Target Schema 

Like many up-conversion problems, the GEDCOM problem is best solved in two stages: the first stage is 
essentially a syntactic transformation of the raw data into XML, and the second stage is a semantic 
transformation to a different data model. 

At the same time as moving to XML, the GEDCOM designers decided it was time to fix some long- 
standard deficiencies in the data model. The draft GEDCOM 6.0 specification [ LPS , 2002 ] therefore not 
only moves from ANSEL character encoding to Unicode, and from COBOL-like level numbers to nested 
XML tags, it also changes the structure of the data. Events, for example, are now primary objects in their 
own right, rather than being always subsidiary to an individual or family. This reflects the fact that there is 
often uncertainty as to whether two events involve the same individual (rather than two distinct individuals 
having the same name), and it also makes it easier to record all the individuals associated with an event - 
for example, the witnesses at a marriage, or the godparents at a christening. 

The transformation of GEDCOM 5.5 files to "raw XML", as described in the previous section, is therefore 
followed by a second transformation, this time to XML that conforms to the target schema defined by 
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GEDCOM 6.0. (I'm taking it as read here that GEDCOM 6.0 exists and is stable and is worth adopting as a 
target. This idealizes the actual state of affairs, but the debate isn't relevant to this paper.) 

Multi-phase transformations can be done in either of two ways: using a single stylesheet (typically using 
different modes for the two phases) or using one stylesheet for each phase. I usually find it is easier to 
develop them using multiple stylesheets, and then integrate them together later as a production application. 

The second transformation is rather more conventional than the first, because it starts with XML as its 
input. I've presented the full stylesheet in XSLT 2.0 Programmer's Reference, and I won't repeat it here in 
full. What I would like to draw out, however, is the impact of making this stylesheet schema-aware. 

The first stylesheet, presented in the previous section, didn't use an XML schema. The input isn't XML, so 
it clearly has no schema; and the output uses a local transient XML vocabulary where the effort of writing 
a schema probably isn't worthwhile. However, for the second stylesheet, the aim is to produce output that 
conforms to a recognized standard XML vocabulary, for which an XML Schema exists, and we clearly 
want to have as much confidence as we can that the stylesheet output will always conform to this target 
schema. 

With XSLT 1.0, the way you achieve this is to run your stylesheet against as many test cases as you can, 
and validate the output of each test case against the target schema. If validation errors are reported, you 
then have to debug the stylesheet to find out why it produced incorrect output in this particular case. 

It would be far better if one could determine statically, purely from examination of the stylesheet, that its 
output will be correct. In practice this is unlikely to be fully achievable, because of the highly dynamic 
nature of XSLT template rules. However, there are many errors that could in principle be detected 
statically, and each error that is found this way makes a significant contribution to easing the testing and 
debugging burden. For example, here is an extract of the second-phase GEDCOM stylesheet: 



<xsl : result -document validat ion^"st rict "> 
<GEDCOM> 

<HeaderRec> 

<FileCreation Date=" { format-date (current-date ( ) , 

' [Dl] [MN, *-3] [Y0001] ' ) } "/> 

<Submitter> 

<Link Targe t="Cont act Rec" Ref="Con tact -Submit ter" /> 
</Submitter> 
</HeaderRec> 

<xsl : call- template name=" families " /> 
<xsl : call -template name- " individuals " /> 
<xsl : call-template name="events "/> 
<Cont act Re c Id= " Contact - S ubm i 1 1 e r " > 

<Name><xsl : value-of select="$submitter " /></Name> 
</ContactRec> 
</GEDCOM> 
</xsl : result -document > 

One can see many potential errors that could be detected statically by the stylesheet compiler. It can check 
that there is a schema definition of the gedcom element, and that HeaderRec and ContactRec are permitted 
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respectively as the first and last child elements of the gedcom element. It can check similarly that the 
elements within the HeaderRec are allowed to appear where they do, that they are allowed to have the 
appropriate attributes, and that none of these elements have required attributes which the stylesheet does 
not generate. In some cases the compiler can also check that the textual content of elements and attributes 
is appropriate to their type. The analysis can extend beyond the fragment shown here to the three named 
templates invoked by this fragment; for example if the call on the individuals template preceded that on 
the families template, then the compiler could deduce that the stylesheet was outputting individuaiRec 
elements ahead of FamiiyRec elements, which the schema does not allow. 

As programmers, we are all familiar with the fact that errors detected at compile-time are much quicker to 
find and to fix than errors detected at run-time. This is as true for XSLT as for any other programming 
language. 

Currently the only schema-aware XSLT processor available is my own Saxon product, and the current 
release (8.0) does not yet do the kind of static checking described above. Even run-time checking, 
however, can pay substantial dividends. For example, one error that I made during development was to 
write an attribute of a literal result element as id= n @iD" instead of id-" { @id} ". Ordinarily, this would 
cause the result document to contain the attribute value id-"0iD". When the programmer gets round to 
validating the output (a stage which is often omitted during development and testing) this would reveal an 
error, because the id attribute is declared as having type xs : id, and an @ character is not allowed in values 
of this type. Running with a schema-aware processor, this error was reported as soon as the offending code 
in the stylesheet was executed, with the incorrect line in the stylesheet being accurately pinpointed. 

I actually found that while developing this and other similar stylesheets, the number of errors detected by 
validation of result trees was so large that it became a little frustrating. Sometimes one actually wants to 
develop a stylesheet "top-down", getting the broad structure of the output right first, and focusing on the 
detail later. As a response to this experience, Saxon 8.1 allows multiple validation errors in the output to be 
reported in a single run, and it allows you to see the (invalid) result tree that was generated, along with 
comments inserted into the XML showing where it is invalid and which stylesheet instructions need to be 
changed to fix the errors. This provides another of the benefits normally associated with compile-time 
errors, the ability to report many errors in a single run. 

Like Other new features in XSLT 2.0, such as xsl : analyze-str ing and xsl : f or-each-group, the 
facility to validate result documents on-the-fly is useful for a wide range of applications, of which up- 
conversion applications are just one example. But taken together, these features make a dramatic difference 
to the ease of developing up-conversion applications when compared with XSLT 1 .0. 



4. Conclusions 

The first part of this paper described four specific features of XSLT 2.0 that make it highly suitable for 
writing up-conversion applications, namely: 
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• The unparsed-text() function 

• Regular expression processing 

• Grouping facilities 

• Schema-aware processing 

The second half of the paper showed how these features can be used in a practical up-conversion exercise, 
the translation of GEDCOM 5.5 genealogical data to the proposed GEDCOM 6.0 XML vocabulary. 

XSLT 1.0 has been widely deployed to achieve both XML-to-XML and XML-to-text transformations. The 
conclusion of this paper is that XSLT 2.0 is also highly suited to a wide range of text-to-XML applications, 
thus greatly increasing the scope of applicability of the language. 
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